Pesquisa | Portal Regional da BVS

1.

A large language model-based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records.

Chiang, Chia-Chun; Luo, Man; Dumkrieger, Gina; Trivedi, Shubham; Chen, Yi-Chieh; Chao, Chieh-Ju; Schwedt, Todd J; Sarker, Abeed; Banerjee, Imon.

Headache ; 64(4): 400-409, 2024 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-38525734

RESUMO

OBJECTIVE: To develop a natural language processing (NLP) algorithm that can accurately extract headache frequency from free-text clinical notes. BACKGROUND: Headache frequency, defined as the number of days with any headache in a month (or 4 weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional NLP algorithms. METHODS: This was a retrospective cross-sectional study with patients identified from two tertiary headache referral centers, Mayo Clinic Arizona and Mayo Clinic Rochester. All neurology consultation notes written by 15 specialized clinicians (11 headache specialists and 4 nurse practitioners) between 2012 and 2022 were extracted and 1915 notes were used for model fine-tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) ClinicalBERT (Bidirectional Encoder Representations from Transformers) regression model, (2) Generative Pre-Trained Transformer-2 (GPT-2) Question Answering (QA) model zero-shot, (3) GPT-2 QA model few-shot training fine-tuned on clinical notes, and (4) GPT-2 generative model few-shot training fine-tuned on clinical notes to generate the answer by considering the context of included text. RESULTS: The mean (standard deviation) headache frequency of our training and testing datasets were 13.4 (10.9) and 14.4 (11.2), respectively. The GPT-2 generative model was the best-performing model with an accuracy of 0.92 (0.91, 0.93, 95% confidence interval [CI]) and R2 score of 0.89 (0.87, 0.90, 95% CI), and all GPT-2-based models outperformed the ClinicalBERT model in terms of exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy of 0.27 (0.26, 0.28), it demonstrated a high R2 score of 0.88 (0.85, 0.89), suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R2 score was higher than the GPT-2 QA zero-shot model or GPT-2 QA model few-shot training fine-tuned model. CONCLUSION: We developed a robust information extraction model based on a state-of-the-art large language model, a GPT-2 generative model that can extract headache frequency from EHR free-text clinical notes with high accuracy and R2 score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT-2-based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT-2 generative model and inference code with open-source license of community use in GitHub. Additional fine-tuning of the algorithm might be required when applied to different health-care systems for various clinical use cases.

Assuntos

Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Humanos , Estudos Retrospectivos , Estudos Transversais , Masculino , Feminino , Cefaleia , Adulto , Pessoa de Meia-Idade , Algoritmos

2.

Area-level Measures of the Social Environment: Operationalization, Pitfalls, and Ways Forward.

Helbich, Marco; Zeng, Yi; Sarker, Abeed.

Curr Top Behav Neurosci ; 2024 Mar 08.

Artigo em Inglês | MEDLINE | ID: mdl-38453766

RESUMO

People's mental health is intertwined with the social environment in which they reside. This chapter explores approaches for quantifying the area-level social environment, focusing specifically on socioeconomic deprivation and social fragmentation. We discuss census data and administrative units, egocentric and ecometric approaches, neighborhood audits, social media data, and street view-based assessments. We close the chapter by discussing possible paths forward from associations between social environments and health to establishing causality, including longitudinal research designs and time-series social environmental indices.

3.

Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium.

Klein, Ari Z; Banda, Juan M; Guo, Yuting; Schmidt, Ana Lucia; Xu, Dongfang; Flores Amaro, Ivan; Rodriguez-Esteban, Raul; Sarker, Abeed; Gonzalez-Hernandez, Graciela.

J Am Med Inform Assoc ; 31(4): 991-996, 2024 Apr 03.

Artigo em Inglês | MEDLINE | ID: mdl-38218723

RESUMO

OBJECTIVE: The aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing social media data for health informatics. In this paper, we present the annotated corpora, a technical summary of participants' systems, and the performance results. METHODS: The eighth iteration of the #SMM4H shared tasks was hosted at the AMIA 2023 Annual Symposium and consisted of 5 tasks that represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and adverse drug events). RESULTS: In total, 29 teams registered, representing 17 countries. In general, the top-performing systems used deep neural network architectures based on pre-trained transformer models. In particular, the top-performing systems for the classification tasks were based on single models that were pre-trained on social media corpora. CONCLUSION: To facilitate future work, the datasets-a total of 61 353 posts-will remain available by request, and the CodaLab sites will remain active for a post-evaluation phase.

Assuntos

Mídias Sociais , Humanos , Mineração de Dados/métodos , Redes Neurais de Computação , Processamento de Linguagem Natural , Aprendizado de Máquina

4.

Detection of Medication Mentions and Medication Change Events in Clinical Notes Using Transformer-Based Models.

Guo, Yuting; Ge, Yao; Sarker, Abeed.

Stud Health Technol Inform ; 310: 685-689, 2024 Jan 25.

Artigo em Inglês | MEDLINE | ID: mdl-38269896

RESUMO

In this paper, we address the related tasks of medication extraction, event classification, and context classification from clinical text. The data for the tasks were obtained from the National Natural Language Processing (NLP) Clinical Challenges (n2c2) Track 1. We developed a named entity recognition (NER) model based on BioClinicalBERT and applied a dictionary-based fuzzy matching mechanism to identify the medication mentions in clinical notes. We developed a unified model architecture for event classification and context classification. The model used two pre-trained models-BioClinicalBERT and RoBERTa to predict the class, separately. Additionally, we applied an ensemble mechanism to combine the predictions of BioClinicalBERT and RoBERTa. For event classification, our best model achieved 0.926 micro-averaged F1-score, 5% higher than the baseline model. The shared task released the data in different stages during the evaluation phase. Our system consistently ranked among the top 10 for Releases 1 and 2.

Assuntos

Fontes de Energia Elétrica , Processamento de Linguagem Natural , Reconhecimento Psicológico

5.

Data Augmentation with Nearest Neighbor Classifier for Few-Shot Named Entity Recognition.

Ge, Yao; Al-Garadi, Mohammed Ali; Sarker, Abeed.

Stud Health Technol Inform ; 310: 690-694, 2024 Jan 25.

Artigo em Inglês | MEDLINE | ID: mdl-38269897

RESUMO

Few-shot learning (FSL) is a category of machine learning models that are designed with the intent of solving problems that have small amounts of labeled data available for training. FSL research progress in natural language processing (NLP), particularly within the medical domain, has been notably slow, primarily due to greater difficulties posed by domain-specific characteristics and data sparsity problems. We explored the use of novel methods for text representation and encoding combined with distance-based measures for improving FSL entity detection. In this paper, we propose a data augmentation method to incorporate semantic information from medical texts into the learning process and combine it with a nearest-neighbor classification strategy for predicting entities. Experiments performed on five biomedical text datasets demonstrate that our proposed approach often outperforms other approaches.

Assuntos

Intenção , Nomes , Análise por Conglomerados , Aprendizado de Máquina , Processamento de Linguagem Natural

6.

Overview of the 8^th Social Media Mining for Health Applications (#SMM4H) Shared Tasks at the AMIA 2023 Annual Symposium.

Klein, Ari Z; Banda, Juan M; Guo, Yuting; Schmidt, Ana Lucia; Xu, Dongfang; Amaro, Jesus Ivan Flores; Rodriguez-Esteban, Raul; Sarker, Abeed; Gonzalez-Hernandez, Graciela.

medRxiv ; 2023 Nov 08.

Artigo em Inglês | MEDLINE | ID: mdl-37986776

RESUMO

The aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing social media data for health informatics. The eighth iteration of the #SMM4H shared tasks was hosted at the AMIA 2023 Annual Symposium and consisted of five tasks that represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and adverse drug events). In total, 29 teams registered, representing 18 countries. In this paper, we present the annotated corpora, a technical summary of the systems, and the performance results. In general, the top-performing systems used deep neural network architectures based on pre-trained transformer models. In particular, the top-performing systems for the classification tasks were based on single models that were pre-trained on social media corpora. To facilitate future work, the datasets-a total of 61,353 posts-will remain available by request, and the CodaLab sites will remain active for a post-evaluation phase.

7.

Public Perceptions About Monkeypox on Twitter: Thematic Analysis.

Leslie, Abimbola; Okunromade, Omolola; Sarker, Abeed.

JMIR Form Res ; 7: e48710, 2023 Nov 03.

Artigo em Inglês | MEDLINE | ID: mdl-37921866

RESUMO

BACKGROUND: Social media has emerged as an important source of information generated by large segments of the population, which can be particularly valuable during infectious disease outbreaks. The recent outbreak of monkeypox led to an increase in discussions about the topic on social media, thus presenting the opportunity to conduct studies based on the generated data. OBJECTIVE: By analyzing posts from Twitter (subsequently rebranded X), we aimed to identify the topics of public discourse as well as knowledge and opinions about the monkeypox virus during the 2022 outbreak. METHODS: We collected data from Twitter focusing on English-language posts containing key phrases like "monkeypox," "mpoxvirus," and "monkey pox," as well as their hashtag equivalents from August to October 2022. We preprocessed the data using natural language processing to remove duplicates and filter out noise. We then selected a random sample from the collected posts. Three annotators reviewed a sample of the posts and created a guideline for coding based on discussion. Finally, the annotators analyzed, coded, and manually categorized them first into topics and then into coarse-grained themes. Disagreements were resolved via discussion among all authors. RESULTS: A total of 128,615 posts were collected over a 3-month period, and 200 tweets were selected and included for manual analyses. The following 8 themes were generated from the Twitter posts: monkeypox doubts, media, monkeypox transmission, effect of monkeypox, knowledge of monkeypox, politics, monkeypox vaccine, and general comments. The most common themes from our study were monkeypox doubts and media, each accounting for 22% (44/200) of the posts. The posts represented a mix of useful information reflecting emerging knowledge on the topic as well as misinformation. CONCLUSIONS: Social networks, such as Twitter, are useful sources of information in the early stages of outbreaks. Close to real-time identification and analyses of misinformation may help authorities take the necessary steps in a timely manner.

8.

Self-reported Xylazine Experiences: A Mixed-methods Study of Reddit Subscribers.

Spadaro, Anthony; O'Connor, Karen; Lakamana, Sahithi; Sarker, Abeed; Wightman, Rachel; Love, Jennifer S; Perrone, Jeanmarie.

J Addict Med ; 17(6): 691-694, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37934533

RESUMO

OBJECTIVES: Xylazine is an α 2 -agonist increasingly prevalent in the illicit drug supply. Our objectives were to curate information about xylazine through social media from people who use drugs (PWUDs). Specifically, we sought to answer the following: (1) What are the demographics of Reddit subscribers reporting exposure to xylazine? (2) Is xylazine a desired additive? And (3) what adverse effects of xylazine are PWUDs experiencing? METHODS: Natural language processing (NLP) was used to identify mentions of "xylazine" from posts by Reddit subscribers who also posted on drug-related subreddits. Posts were qualitatively evaluated for xylazine-related themes. A survey was developed to gather additional information about the Reddit subscribers. This survey was posted on subreddits that were identified by NLP to contain xylazine-related discussions from March 2022 to October 2022. RESULTS: Seventy-six posts were extracted via NLP from 765,616 posts by 16,131 Reddit subscribers (January 2018 to August 2021). People on Reddit described xylazine as an unwanted adulterant in their opioid supply. Sixty-one participants completed the survey. Of those who disclosed their location, 25 of 50 participants (50%) reported locations in the Northeastern United States. The most common route of xylazine use was intranasal use (57%). Thirty-one of 59 (53%) reported experiencing xylazine withdrawal. Frequent adverse events reported were prolonged sedation (81%) and increased skin wounds (43%). CONCLUSIONS: Among respondents on these Reddit forums, xylazine seems to be an unwanted adulterant. People who use drugs may be experiencing adverse effects such as prolonged sedation and xylazine withdrawal. This seemed to be more common in the Northeast.

Assuntos

Drogas Ilícitas , Xilazina , Humanos , Autorrelato , Analgésicos Opioides , Transtorno da Personalidade Antissocial

9.

A Large Language Model-Based Generative Natural Language Processing Framework Finetuned on Clinical Notes Accurately Extracts Headache Frequency from Electronic Health Records.

Chiang, Chia-Chun; Luo, Man; Dumkrieger, Gina; Trivedi, Shubham; Chen, Yi-Chieh; Chao, Chieh-Ju; Schwedt, Todd J; Sarker, Abeed; Banerjee, Imon.

medRxiv ; 2023 Oct 03.

Artigo em Inglês | MEDLINE | ID: mdl-37873417

RESUMO

Background: Headache frequency, defined as the number of days with any headache in a month (or four weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional natural language processing (NLP) algorithms. Methods: This was a retrospective cross-sectional study with human subjects identified from three tertiary headache referral centers- Mayo Clinic Arizona, Florida, and Rochester. All neurology consultation notes written by more than 10 headache specialists between 2012 to 2022 were extracted and 1915 notes were used for model fine-tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) ClinicalBERT (Bidirectional Encoder Representations from Transformers) regression model (2) Generative Pre-Trained Transformer-2 (GPT-2) Question Answering (QA) Model zero-shot (3) GPT-2 QA model few-shot training fine-tuned on Mayo Clinic notes; and (4) GPT-2 generative model few-shot training fine-tuned on Mayo Clinic notes to generate the answer by considering the context of included text. Results: The GPT-2 generative model was the best-performing model with an accuracy of 0.92[0.91 - 0.93] and R2 score of 0.89[0.87, 0.9], and all GPT2-based models outperformed the ClinicalBERT model in terms of the exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy 0.27[0.26 - 0.28], it demonstrated a high R2 score 0.88[0.85, 0.89], suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R2 score was higher than the GPT-2 QA zero-shot model or GPT-2 QA model few-shot training fine-tuned model. Conclusion: We developed a robust model based on a state-of-the-art large language model (LLM)- a GPT-2 generative model that can extract headache frequency from EHR free-text clinical notes with high accuracy and R2 score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT2-based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT-2 generative model and inference code with open-source license of community use in GitHub.

10.

An aspect-level sentiment analysis dataset for therapies on Twitter.

Guo, Yuting; Das, Sudeshna; Lakamana, Sahithi; Sarker, Abeed.

Data Brief ; 50: 109618, 2023 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-37808542

RESUMO

The dataset described is an aspect-level sentiment analysis dataset for therapies, including medication, behavioral and other therapies, created by leveraging user-generated text from Twitter. The dataset was constructed by collecting Twitter posts using keywords associated with the therapies (often referred to as treatments). Subsequently, subsets of the collected posts were manually reviewed, and annotation guidelines were developed to categorize the posts as positive, negative, or neutral. The dataset contains a total of 5364 posts mentioning 32 therapies. These posts are further categorized manually into 998 (18.6%) positive, 619 (11.5%) negatives, and 3747 (69.9%) neutral sentiments. The inter-annotation agreement for the dataset was evaluated using Cohen's Kappa score, achieving an 0.82 score. The potential use of this dataset lies in the development of automatic systems that can detect users' sentiments toward therapies based on their posts. While there are other sentiment analysis datasets available, this is the first that encodes sentiments associated with specific therapies. Researchers and developers can utilize this dataset to train sentiment analysis models, natural language processing algorithms, or machine learning systems to accurately identify and analyze the sentiments expressed by consumers on social media platforms like Twitter.

11.

A framework for multi-faceted content analysis of social media chatter regarding non-medical use of prescription medications.

Raza, Shaina; Schwartz, Brian; Lakamana, Sahithi; Ge, Yao; Sarker, Abeed.

BMC Digit Health ; 12023.

Artigo em Inglês | MEDLINE | ID: mdl-37680768

RESUMO

Background: Substance use, including the non-medical use of prescription medications, is a global health problem resulting in hundreds of thousands of overdose deaths and other health problems. Social media has emerged as a potent source of information for studying substance use-related behaviours and their consequences. Mining large-scale social media data on the topic requires the development of natural language processing (NLP) and machine learning frameworks customized for this problem. Our objective in this research is to develop a framework for conducting a content analysis of Twitter chatter about the non-medical use of a set of prescription medications. Methods: We collected Twitter data for four medications-fentanyl and morphine (opioids), alprazolam (benzodiazepine), and Adderall® (stimulant), and identified posts that indicated non-medical use using an automatic machine learning classifier. In our NLP framework, we applied supervised named entity recognition (NER) to identify other substances mentioned, symptoms, and adverse events. We applied unsupervised topic modelling to identify latent topics associated with the chatter for each medication. Results: The quantitative analysis demonstrated the performance of the proposed NER approach in identifying substance-related entities from data with a high degree of accuracy compared to the baseline methods. The performance evaluation of the topic modelling was also notable. The qualitative analysis revealed knowledge about the use, non-medical use, and side effects of these medications in individuals and communities. Conclusions: NLP-based analyses of Twitter chatter associated with prescription medications belonging to different categories provide multi-faceted insights about their use and consequences. Our developed framework can be applied to chatter about other substances. Further research can validate the predictive value of this information on the prevention, assessment, and management of these disorders.

12.

Few-shot learning for medical text: A review of advances, trends, and opportunities.

Ge, Yao; Guo, Yuting; Das, Sudeshna; Al-Garadi, Mohammed Ali; Sarker, Abeed.

J Biomed Inform ; 144: 104458, 2023 08.

Artigo em Inglês | MEDLINE | ID: mdl-37488023

RESUMO

BACKGROUND: Few-shot learning (FSL) is a class of machine learning methods that require small numbers of labeled instances for training. With many medical topics having limited annotated text-based data in practical settings, FSL-based natural language processing (NLP) holds substantial promise. We aimed to conduct a review to explore the current state of FSL methods for medical NLP. METHODS: We searched for articles published between January 2016 and October 2022 using PubMed/Medline, Embase, ACL Anthology, and IEEE Xplore Digital Library. We also searched the preprint servers (e.g., arXiv, medRxiv, and bioRxiv) via Google Scholar to identify the latest relevant methods. We included all articles that involved FSL and any form of medical text. We abstracted articles based on the data source, target task, training set size, primary method(s)/approach(es), and evaluation metric(s). RESULTS: Fifty-one articles met our inclusion criteria-all published after 2018, and most since 2020 (42/51; 82%). Concept extraction/named entity recognition was the most frequently addressed task (21/51; 41%), followed by text classification (16/51; 31%). Thirty-two (61%) articles reconstructed existing datasets to fit few-shot scenarios, and MIMIC-III was the most frequently used dataset (10/51; 20%). 77% of the articles attempted to incorporate prior knowledge to augment the small datasets available for training. Common methods included FSL with attention mechanisms (20/51; 39%), prototypical networks (11/51; 22%), meta-learning (7/51; 14%), and prompt-based learning methods, the latter being particularly popular since 2021. Benchmarking experiments demonstrated relative underperformance of FSL methods on biomedical NLP tasks. CONCLUSION: Despite the potential for FSL in biomedical NLP, progress has been limited. This may be attributed to the rarity of specialized data, lack of standardized evaluation criteria, and the underperformance of FSL methods on biomedical topics. The creation of publicly-available specialized datasets for biomedical FSL may aid method development by facilitating comparative analyses.

Assuntos

Aprendizado de Máquina , Processamento de Linguagem Natural , PubMed , MEDLINE , Publicações

13.

Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes.

Guo, Yuting; Al-Garadi, Mohammed A; Book, Wendy M; Ivey, Lindsey C; Rodriguez, Fred H; Raskind-Hood, Cheryl L; Robichaux, Chad; Sarker, Abeed.

J Am Heart Assoc ; 12(13): e030046, 2023 07 04.

Artigo em Inglês | MEDLINE | ID: mdl-37345821

RESUMO

Background The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by International Classification of Diseases (ICD) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing-based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with ICD code-based classification. Methods and Results We included free-text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non-Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer-based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and ICD code-based classification on 20% of the held-out patient data using the F1 score metric. The ICD classification model, support vector machine, and RoBERTa achieved F1 scores of 0.81 (95% CI, 0.79-0.83), 0.95 (95% CI, 0.92-0.97), and 0.89 (95% CI, 0.88-0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance (P<0.05), and both natural language processing models outperformed ICD code-based classification (P<0.05). The sliding window strategy improved performance over the base model (P<0.05) but did not outperform support vector machines. ICD code-based classification produced more false positives. Conclusions Natural language processing models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes, and the former demonstrated the possibility of further improvement.

Assuntos

Classificação Internacional de Doenças , Processamento de Linguagem Natural , Humanos , Aprendizado de Máquina , Registros Eletrônicos de Saúde , Eletrônica

14.

Generalizable Natural Language Processing Framework for Migraine Reporting from Social Media.

Guo, Yuting; Rajwal, Swati; Lakamana, Sahithi; Chiang, Chia-Chun; Menell, Paul C; Shahid, Adnan H; Pharm D, Yi-Chieh Chen; Chhabra, Nikita; Chao, Wan-Ju; Chao, Chieh-Ju; Schwedt, Todd J; Banerjee, Imon; Sarker, Abeed.

AMIA Jt Summits Transl Sci Proc ; 2023: 261-270, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37350878

RESUMO

Migraine is a highly prevalent and disabling neurological disorder. However, information about migraine management in real-world settings is limited to traditional health information sources. In this paper, we (i) verify that there is substantial migraine-related chatter available on social media (Twitter and Reddit), self-reported by those with migraine; (ii) develop a platform-independent text classification system for automatically detecting self-reported migraine-related posts, and (iii) conduct analyses of the self-reported posts to assess the utility of social media for studying this problem. We manually annotated 5750 Twitter posts and 302 Reddit posts, and used them for training and evaluating supervised machine learning methods. Our best system achieved an F1 score of 0.90 on Twitter and 0.93 on Reddit. Analysis of information posted by our 'migraine cohort' revealed the presence of a plethora of relevant information about migraine therapies and sentiments associated with them. Our study forms the foundation for conducting an in-depth analysis of migraine-related information using social media data.

15.

Automatic Detection of Intimate Partner Violence Victims from Social Media for Proactive Delivery of Support.

Guo, Yuting; Kim, Sangmi; Warren, Elise; Yang, Yuan-Chi; Lakamana, Sahithi; Sarker, Abeed.

AMIA Jt Summits Transl Sci Proc ; 2023: 254-260, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37351791

RESUMO

Social media platforms are increasingly being used by intimate partner violence (IPV) victims to share experiences and seek support. If such information is automatically curated, it may be possible to conduct social media based surveillance and even design interventions over such platforms. In this paper, we describe the development of a supervised classification system that automatically characterizes IPV-related posts on the social network Reddit. We collected data from four IPV-related subreddits and manually annotated the data to indicate whether a post is a self-report of IPV or not. Using the annotated data (N=289), we trained, evaluated, and compared supervised machine learning systems. A transformer-based classifier, RoBERTa, obtained the best classification performance with overall accuracy of 78% and IPV-self-report class ð¹1 -score of 0.67. Post-classification error analyses revealed that misclassifications often occur for posts that are very long or are non-first-person reports of IPV. Despite the relatively small annotated data, our classification methods obtained promising results, indicating that it may be possible to detect and, hence, provide support to IPV victims over Reddit.

16.

Barriers to opioid use disorder treatment: A comparison of self-reported information from social media with barriers found in literature.

Bremer, Whitney; Plaisance, Karma; Walker, Drew; Bonn, Matthew; Love, Jennifer S; Perrone, Jeanmarie; Sarker, Abeed.

Front Public Health ; 11: 1141093, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37151596

RESUMO

Introduction: Medications such as buprenorphine and methadone are effective for treating opioid use disorder (OUD), but many patients face barriers related to treatment and access. We analyzed two sources of data-social media and published literature-to categorize and quantify such barriers. Methods: In this mixed methods study, we analyzed social media (Reddit) posts from three OUD-related forums (subreddits): r/suboxone, r/Methadone, and r/naltrexone. We applied natural language processing to identify posts relevant to treatment barriers, categorized them into insurance- and non-insurance-related, and manually subcategorized them into fine-grained topics. For comparison, we used substance use-, OUD- and barrier-related keywords to identify relevant articles from PubMed published between 2006 and 2022. We searched publications for language expressing fear of barriers, and hesitation or disinterest in medication treatment because of barriers, paying particular attention to the affected population groups described. Results: On social media, the top three insurance-related barriers included having no insurance (22.5%), insurance not covering OUD treatment (24.7%), and general difficulties of using insurance for OUD treatment (38.2%); while the top two non-insurance-related barriers included stigma (47.6%), and financial difficulties (26.2%). For published literature, stigma was the most prominently reported barrier, occurring in 78.9% of the publications reviewed, followed by financial and/or logistical issues to receiving medication treatment (73.7%), gender-specific barriers (36.8%), and fear (31.5%). Conclusion: The stigma associated with OUD and/or seeking treatment and insurance/cost are the two most common types of barriers reported in the two sources combined. Harm reduction efforts addressing barriers to recovery may benefit from leveraging multiple data sources.

Assuntos

Transtornos Relacionados ao Uso de Opioides , Mídias Sociais , Humanos , Autorrelato , Tratamento de Substituição de Opiáceos/métodos , Transtornos Relacionados ao Uso de Opioides/tratamento farmacológico , Metadona/uso terapêutico

17.

The Early Detection of Fraudulent COVID-19 Products From Twitter Chatter: Data Set and Baseline Approach Using Anomaly Detection.

Sarker, Abeed; Lakamana, Sahithi; Liao, Ruqi; Abbas, Aamir; Yang, Yuan-Chi; Al-Garadi, Mohammed.

JMIR Infodemiology ; 3: e43694, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37113382

RESUMO

Background: Social media has served as a lucrative platform for spreading misinformation and for promoting fraudulent products for the treatment, testing, and prevention of COVID-19. This has resulted in the issuance of many warning letters by the US Food and Drug Administration (FDA). While social media continues to serve as the primary platform for the promotion of such fraudulent products, it also presents the opportunity to identify these products early by using effective social media mining methods. Objective: Our objectives were to (1) create a data set of fraudulent COVID-19 products that can be used for future research and (2) propose a method using data from Twitter for automatically detecting heavily promoted COVID-19 products early. Methods: We created a data set from FDA-issued warnings during the early months of the COVID-19 pandemic. We used natural language processing and time-series anomaly detection methods for automatically detecting fraudulent COVID-19 products early from Twitter. Our approach is based on the intuition that increases in the popularity of fraudulent products lead to corresponding anomalous increases in the volume of chatter regarding them. We compared the anomaly signal generation date for each product with the corresponding FDA letter issuance date. We also performed a brief manual analysis of chatter associated with 2 products to characterize their contents. Results: FDA warning issue dates ranged from March 6, 2020, to June 22, 2021, and 44 key phrases representing fraudulent products were included. From 577,872,350 posts made between February 19 and December 31, 2020, which are all publicly available, our unsupervised approach detected 34 out of 44 (77.3%) signals about fraudulent products earlier than the FDA letter issuance dates, and an additional 6 (13.6%) within a week following the corresponding FDA letters. Content analysis revealed misinformation, information, political, and conspiracy theories to be prominent topics. Conclusions: Our proposed method is simple, effective, easy to deploy, and does not require high-performance computing machinery unlike deep neural network-based methods. The method can be easily extended to other types of signal detection from social media data. The data set may be used for future research and the development of more advanced methods.

18.

Characteristics of Intimate Partner Violence and Survivor's Needs During the COVID-19 Pandemic: Insights From Subreddits Related to Intimate Partner Violence.

Kim, Sangmi; Warren, Elise; Jahangir, Tasfia; Al-Garadi, Mohammed; Guo, Yuting; Yang, Yuan-Chi; Lakamana, Sahithi; Sarker, Abeed.

J Interpers Violence ; 38(17-18): 9693-9716, 2023 09.

Artigo em Inglês | MEDLINE | ID: mdl-37102576

RESUMO

Intimate partner violence (IPV) increased during the COVID-19 pandemic. Collecting actionable IPV-related data from conventional sources (e.g., medical records) was challenging during the pandemic, generating a need to obtain relevant data from non-conventional sources, such as social media. Social media, like Reddit, is a preferred medium of communication for IPV survivors to share their experiences and seek support with protected anonymity. Nevertheless, the scope of available IPV-related data on social media is rarely documented. Thus, we examined the availability of IPV-related information on Reddit and the characteristics of the reported IPV during the pandemic. Using natural language processing, we collected publicly available Reddit data from four IPV-related subreddits between January 1, 2020 and March 31, 2021. Of 4,000 collected posts, we randomly sampled 300 posts for analysis. Three individuals on the research team independently coded the data and resolved the coding discrepancies through discussions. We adopted quantitative content analysis and calculated the frequency of the identified codes. 36% of the posts (n = 108) constituted self-reported IPV by survivors, of which 40% regarded current/ongoing IPV, and 14% contained help-seeking messages. A majority of the survivors' posts reflected psychological aggression, followed by physical violence. Notably, 61.4% of the psychological aggression involved expressive aggression, followed by gaslighting (54.3%) and coercive control (44.3%). Survivors' top three needs during the pandemic were hearing similar experiences, legal advice, and validating their feelings/reactions/thoughts/actions. Albeit limited, data from bystanders (survivors' friends, family, or neighbors) were also available. Rich data reflecting IPV survivors' lived experiences were available on Reddit. Such information will be useful for IPV surveillance, prevention, and intervention.

Assuntos

COVID-19 , Violência por Parceiro Íntimo , Humanos , Pandemias , Violência por Parceiro Íntimo/psicologia , Coerção , Sobreviventes/psicologia

19.

An analysis of cannabinoid hyperemesis syndrome Reddit posts and themes.

Wightman, Rachel S; Perrone, Jeanmarie; Collins, Alexandra B; Lakamana, Sahithi; Sarker, Abeed.

Clin Toxicol (Phila) ; 61(4): 283-289, 2023 04.

Artigo em Inglês | MEDLINE | ID: mdl-37014024

RESUMO

INTRODUCTION: Reddit hosts a large active community of members dedicated to the discussion of cannabinoid hyperemesis syndrome. We sought to describe common themes discussed and the most frequently mentioned triggers and therapies for cannabinoid hyperemesis syndrome exacerbations in the Reddit online community. METHODS: Data collected from six subreddits were filtered using natural language processing to curate posts referencing cannabinoid hyperemesis syndrome. Based on a manual review of posts, common themes were identified. A machine learning model was trained using the manually categorized data to automatically classify the themes for the rest of the posts so that their distributions could be quantified. RESULTS: From August 2018 to November 2022, 2683 unique posts were collected. Thematic analysis resulted in five overall themes: cannabinoid hyperemesis syndrome-related science; symptom timing; cannabinoid hyperemesis syndrome treatment and prevention; cannabinoid hyperemesis syndrome diagnosis and education; and health impacts. Additionally, 447 trigger and 664 therapy-related posts were identified. The most commonly mentioned triggers for cannabinoid hyperemesis syndrome episodes included: food and drink (n = 62), cannabinoids (n = 45), mental health (e.g., stress, anxiety) (n = 27), and alcohol (n = 22). Most commonly mentioned cannabinoid hyperemesis syndrome therapies included: hot water/bathing (n = 62), hydration (n = 60), antiemetics (n = 42), food and drink (n = 38), gastrointestinal medications (n = 38), behavioral therapies (e.g., meditation, yoga) (n = 35), and capsaicin (n = 29). DISCUSSION: Reddit posts for cannabinoid hyperemesis syndrome provide a valuable source of community discussion and individual reports of people experiencing cannabinoid hyperemesis syndrome. Mental health and alcohol were frequently reported triggers within the posts but are not often identified in the literature. While many of the therapies mentioned are well documented, behavioral responses such as meditation and yoga have not been explored by the scientific literature. CONCLUSIONS: Knowledge shared via online social media platforms contains detailed information on self-reported cannabinoid hyperemesis syndrome disease and management experiences, which could serve as valuable data for the development of treatment strategies. Further longitudinal studies in patients with cannabinoid hyperemesis syndrome are needed to corroborate these findings.

Assuntos

Antieméticos , Canabinoides , Abuso de Maconha , Humanos , Canabinoides/efeitos adversos , Vômito/tratamento farmacológico , Antieméticos/uso terapêutico , Ansiedade , Síndrome , Abuso de Maconha/tratamento farmacológico

20.

Self-reported Xylazine Experiences: A Mixed Methods Study of Reddit Subscribers.

Spadaro, Anthony; Connor, Karen O'; Lakamana, Sahithi; Sarker, Abeed; Wightman, Rachel; Love, Jennifer S; Perrone, Jeanmarie.

medRxiv ; 2023 Mar 14.

Artigo em Inglês | MEDLINE | ID: mdl-36993695

RESUMO

Objectives: Xylazine is an alpha-2 agonist increasingly prevalent in the illicit drug supply. Our objectives were to curate information about xylazine through social media from People Who Use Drugs (PWUDs). Specifically, we sought to answer the following: 1) what are the demographics of Reddit subscribers reporting exposure to xylazine? 2) is xylazine a desired additive? and 3) what adverse effects of xylazine are PWUDs experiencing? Methods: Natural Language Processing (NLP) was used to identify mentions of "xylazine" from posts by Reddit subscribers who also posted on drug-related subreddits. Posts were qualitatively evaluated for xylazine-related themes. A survey was developed to gather additional information about the Reddit subscribers. This survey was posted on subreddits that were identified by NLP to contain xylazine-related discussions from March 2022 to October 2022. Results: 76 posts mentioning xylazine were extracted via NLP from 765,616 posts by 16,131 Reddit subscribers (January 2018 to August 2021). People on Reddit described xylazine as an unwanted adulterant in their opioid supply. 61 participants completed the survey. Of those that disclosed their location, 25/50 (50%) participants reported locations in the Northeastern United States. The most common eoute of xylazine use was intranasal use (57%). 31/59 (53%) reported experiencing xylazine withdrawal. Frequent adverse events reported were prolonged sedation (81%) and increased skin wounds (43%). Conclusions: Among respondents on these Reddit forums, xylazine appears to be an unwanted adulterant. PWUDs may be experiencing adverse effects such as prolonged sedation and xylazine withdrawal. This appeared to be more common in the Northeast.

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA